White Wine Quality Exploration by Josh See

Introduction

The goal of this project is to explore the chemical properties found in a tidy data set of white wines and to understand and summarize which properties is closely related to the wine quality score.

Load the Data

Data summary

##        X        fixed.acidity    volatile.acidity  citric.acid    
##  Min.   :   1   Min.   : 3.800   Min.   :0.0800   Min.   :0.0000  
##  1st Qu.:1225   1st Qu.: 6.300   1st Qu.:0.2100   1st Qu.:0.2700  
##  Median :2450   Median : 6.800   Median :0.2600   Median :0.3200  
##  Mean   :2450   Mean   : 6.855   Mean   :0.2782   Mean   :0.3342  
##  3rd Qu.:3674   3rd Qu.: 7.300   3rd Qu.:0.3200   3rd Qu.:0.3900  
##  Max.   :4898   Max.   :14.200   Max.   :1.1000   Max.   :1.6600  
##  residual.sugar     chlorides       free.sulfur.dioxide
##  Min.   : 0.600   Min.   :0.00900   Min.   :  2.00     
##  1st Qu.: 1.700   1st Qu.:0.03600   1st Qu.: 23.00     
##  Median : 5.200   Median :0.04300   Median : 34.00     
##  Mean   : 6.391   Mean   :0.04577   Mean   : 35.31     
##  3rd Qu.: 9.900   3rd Qu.:0.05000   3rd Qu.: 46.00     
##  Max.   :65.800   Max.   :0.34600   Max.   :289.00     
##  total.sulfur.dioxide    density             pH          sulphates     
##  Min.   :  9.0        Min.   :0.9871   Min.   :2.720   Min.   :0.2200  
##  1st Qu.:108.0        1st Qu.:0.9917   1st Qu.:3.090   1st Qu.:0.4100  
##  Median :134.0        Median :0.9937   Median :3.180   Median :0.4700  
##  Mean   :138.4        Mean   :0.9940   Mean   :3.188   Mean   :0.4898  
##  3rd Qu.:167.0        3rd Qu.:0.9961   3rd Qu.:3.280   3rd Qu.:0.5500  
##  Max.   :440.0        Max.   :1.0390   Max.   :3.820   Max.   :1.0800  
##     alcohol         quality     
##  Min.   : 8.00   Min.   :3.000  
##  1st Qu.: 9.50   1st Qu.:5.000  
##  Median :10.40   Median :6.000  
##  Mean   :10.51   Mean   :5.878  
##  3rd Qu.:11.40   3rd Qu.:6.000  
##  Max.   :14.20   Max.   :9.000

Data dimensions

## [1] 4898   13

Feature names

##  [1] "X"                    "fixed.acidity"        "volatile.acidity"    
##  [4] "citric.acid"          "residual.sugar"       "chlorides"           
##  [7] "free.sulfur.dioxide"  "total.sulfur.dioxide" "density"             
## [10] "pH"                   "sulphates"            "alcohol"             
## [13] "quality"

Data Structure

## 'data.frame':    4898 obs. of  13 variables:
##  $ X                   : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ fixed.acidity       : num  7 6.3 8.1 7.2 7.2 8.1 6.2 7 6.3 8.1 ...
##  $ volatile.acidity    : num  0.27 0.3 0.28 0.23 0.23 0.28 0.32 0.27 0.3 0.22 ...
##  $ citric.acid         : num  0.36 0.34 0.4 0.32 0.32 0.4 0.16 0.36 0.34 0.43 ...
##  $ residual.sugar      : num  20.7 1.6 6.9 8.5 8.5 6.9 7 20.7 1.6 1.5 ...
##  $ chlorides           : num  0.045 0.049 0.05 0.058 0.058 0.05 0.045 0.045 0.049 0.044 ...
##  $ free.sulfur.dioxide : num  45 14 30 47 47 30 30 45 14 28 ...
##  $ total.sulfur.dioxide: num  170 132 97 186 186 97 136 170 132 129 ...
##  $ density             : num  1.001 0.994 0.995 0.996 0.996 ...
##  $ pH                  : num  3 3.3 3.26 3.19 3.19 3.26 3.18 3 3.3 3.22 ...
##  $ sulphates           : num  0.45 0.49 0.44 0.4 0.4 0.44 0.47 0.45 0.49 0.45 ...
##  $ alcohol             : num  8.8 9.5 10.1 9.9 9.9 10.1 9.6 8.8 9.5 11 ...
##  $ quality             : int  6 6 6 6 6 6 6 6 6 6 ...

I found out that the first col X is a unique index variable for each individual observation. I think it is not very useful to the analysis so its best to remove it before procceding to the next step.

Univariate Plots Section

Observation

We already know that the quality of the wine is rated by wine experts from the scale of 0 (very bad) to 10 (very excellent)

The lowest and highest quality score given is 3 and 9, with a mean of 5.878

All the attribute provided is decimal values except quality which is a integer.

Attributes like fixed.acidity, volatile.acidity, citric.acid, residual.sugar, chlorides, free.sulfur.dioxide, total.sulfur.dioxide, sulphates have max value that is greater than 75% quantile.

Lets take a look at each properties to get a sense of data distribution.

Quality

Levels attributes of wine quality

##  [1] "0"  "1"  "2"  "3"  "4"  "5"  "6"  "7"  "8"  "9"  "10"

Summary

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   3.000   5.000   6.000   5.878   6.000   9.000

Summary of Quality as factor

##    0    1    2    3    4    5    6    7    8    9   10 
##    0    0    0   20  163 1457 2198  880  175    5    0

Mode of Qualiy Score

Mode of the quality score is 6. Same as what we saw in summary of quality_as_factor.

Histogram of Quality

I factor the wine quality from an integer variable to a categorical varible (quality_as_factor) as its kind of arbitary and can be represented by another form of values not just integer. The histogram shows the shape of normal distribution.

The range of possible scores is from 0 to 10, in the dataset the minimum score is 3 and maximum is 9. The mean is 5.878 and the median is 6 which is very close to each other.

Fixed.acidity

## 
##  3.8  3.9  4.2  4.4  4.5  4.6  4.7  4.8  4.9    5  5.1  5.2  5.3  5.4  5.5 
##    1    1    2    3    1    1    5    9    7   24   23   28   27   28   31 
##  5.6  5.7  5.8  5.9    6  6.1 6.15  6.2  6.3  6.4 6.45  6.5  6.6  6.7  6.8 
##   71   88  121  103  184  155    2  192  188  280    1  225  290  236  308 
##  6.9    7  7.1 7.15  7.2  7.3  7.4  7.5  7.6  7.7  7.8  7.9    8  8.1  8.2 
##  241  232  200    2  206  178  194  123  153   93   93   74   80   56   56 
##  8.3  8.4  8.5  8.6  8.7  8.8  8.9    9  9.1  9.2  9.3  9.4  9.5  9.6  9.7 
##   52   35   32   25   15   18   16   17    6   21    3   11    2    5    4 
##  9.8  9.9   10 10.2 10.3 10.7 11.8 14.2 
##    8    2    3    1    2    2    1    1
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   3.800   6.300   6.800   6.855   7.300  14.200

There seems be a right tail when the histogram was initially chartted. Once the outlier which is a value 11.8 and 14.2 is removed, the histogram show a normal distribution.

Volatile.acidity

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0800  0.2100  0.2600  0.2782  0.3200  1.1000
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## -1.09700 -0.67780 -0.58500 -0.58090 -0.49490  0.04139

By transforming the variable using a log based 10, we able to remove the right tail and turn it into a normal distribution.

Citric.acid

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.2700  0.3200  0.3342  0.3900  1.6600

Using a box plot, we can cleary visualize the outlier which may cause the histogram to have a long right tail. Once we removed that, our histogram shows a normal distribution.

Residual.sugar

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.600   1.700   5.200   6.391   9.900  65.800
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -0.2218  0.2304  0.7160  0.6432  0.9956  1.8180

I try to remove the outlier after looking at the boxplot. The histogram seems to show a positive skewed graph. However, after applying a log based 10 transformation. This histogram still does not appear to be a normal distribution. It looks more like a bi-model distribution.

chlorides

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -2.0460 -1.4440 -1.3670 -1.3680 -1.3010 -0.4609

Chlorides shows a normal distribution once a log based 10 transformation is applied.

free.sulfur.dioxide

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    2.00   23.00   34.00   35.31   46.00  289.00

Once outlier is removed, the histogram appears as a normal distribution.

total.sulfur.dioxide

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     9.0   108.0   134.0   138.4   167.0   440.0

Same for total.sulfur.dioxide once right tail outlier is removed, the histogram appears as a normal distribution.

pH

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.720   3.090   3.180   3.188   3.280   3.820

pH seems to give a normal distribution once histogram is graphed. There is no transformation that needs to be applied to it.

sulphates

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.2200  0.4100  0.4700  0.4898  0.5500  1.0800

##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## -0.65760 -0.38720 -0.32790 -0.32100 -0.25960  0.03342

Applying log based 10 transformation gives us a normal distribution.

density

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.9871  0.9917  0.9937  0.9940  0.9961  1.0390

The desity histogram shows a normal distribution once the outlier is removed.

alcohol

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.00    9.50   10.40   10.51   11.40   14.20
##        3        4        5        6        7        8        9 
## 10.34500 10.15245  9.80884 10.57537 11.36794 11.63600 12.18000

This histogram shows a slight positive skewed distribution with a peak between 9 and 10.

Univariate Analysis

What is the structure of your dataset?

There are 4,898 observations and 13 variables in the dataset. First variable X is basically an index of each observations. It is being removed as its useless for the analysis. The rest of the variables which is stored as numerical data type is basically the properties of white wine. Last variable is an integer quality score which is rated by wine experts from the scale of 0 (very bad) to 10 (very excellent). The quality score is converted to a factor data types. Most of the wines have a quality score of 6, the lowest and highest score given is 3 and and 9.

What is/are the main feature(s) of interest in your dataset?

The main feature that is important to this analysis is the wine quality. Analysis needs to be done on the wine properties to see whether it has an impact to this outcome.

What other features in the dataset do you think will help support your investigation into your feature(s) of interest?

Lets take a look at how correlated the variables are:

##                              [,1]
## fixed.acidity        -0.113662831
## volatile.acidity     -0.194722969
## citric.acid          -0.009209091
## residual.sugar       -0.097576829
## chlorides            -0.209934411
## free.sulfur.dioxide   0.008158067
## total.sulfur.dioxide -0.174737218
## density              -0.307123313
## pH                    0.099427246
## sulphates             0.053677877
## alcohol               0.435574715

We can see that the following variables are correlated with quallity:

  • density (Highly correlated)
  • chlorides (Moderatelly correlated)
  • volatile.acidity
  • total.sulfur.dioxide
  • fixed.acidity
  • residual.sugar
  • citric.acid
  • free.sulfur.dioxide
  • sulphates
  • pH (Moderatelly correlated)
  • alcohol (Highly correlated)

We will concentrate on the top variables that show strong correlation.

Did you create any new variables from existing variables in the dataset?

Yes, since we know that the mode of quality score is 6. We would consider the average is 6 in a 0 to 10 scale. We can define a cut in the scores.

## 
##  (0,5]  (5,6] (6,10] 
##   1640   2198   1060

So we will have three groups of wines score after the cut. First group has a quality score from 0 to 5, which we can consider the bad quality group. Then the second group which has a score of 6 average quality group and lastly the 7 to 10 quality score group is the best quality group. We will these groups in our analysis.

Of the features you investigated, were there any unusual distributions? Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this?

I created a few histograms to understand the distribution of the features and box plots to find out the outliers. Yes, there were a few outliers in the features in which I removed to get it to look gaussian or normal distribution. I also applied a log based 10 transformation on the features which had long tails so that the features becomes gaussian. qualityand pH were the features which I did not apply any transformations as they look like normal distribution. Alcohol feature looks like a positive skewed distribution.

Bivariate Plots Section

I decided to plot out ggpairs scatterplot matrix to have a look into the relationships between the variables. I found out that the 4 most correclated variables with quality were alcohol, ph, chlorides, density. But lets plot all other variables as well.

Boxplot of Quality vs all Other variable

Mean and average mean of top 4 correlated variables

Mean and average mean of Alcohol

## [1] 10.51427
##        3        4        5        6        7        8        9 
## 10.34500 10.15245  9.80884 10.57537 11.36794 11.63600 12.18000

Mean and average mean of pH

## [1] 3.188267
##        3        4        5        6        7        8        9 
## 3.187500 3.182883 3.168833 3.188599 3.213898 3.218686 3.308000

Mean and average mean of density

## [1] 0.9940274
##         3         4         5         6         7         8         9 
## 0.9948840 0.9942767 0.9952626 0.9939613 0.9924524 0.9922359 0.9914600

Mean and average mean of chlorides

## [1] 0.04577236
##          3          4          5          6          7          8 
## 0.05430000 0.05009816 0.05154633 0.04521747 0.03819091 0.03831429 
##          9 
## 0.02740000

From the boxplot above, we were able to understand more about the relationship between quality and other variables. In the alcohol vs quality plot, we can clearly see that as the quality increases as alcohol varies. It has a mean of 10.514 and the mean increases as the quality score is above 5. Same goes with pH. It pH mean values starts to increase when is above 5. density and chlorides on the other hand shows a negative influence to quality score. Its means are 0.994 and 0.046 for density and chlorides, both average means decreases as the quality score is above 5.

We can verify this using a scatter plot.

Quality vs Alcohol and Quality vs pH

Quality vs Density and Quality vs Chlorides

We already see a tendency in the boxplot, this can be better illustrate with a scatter plot and a linear regression line in it. Good wines tends to have higher alcohol level and higher pH.

We explore further to find out the influence of main features with other secondary features.

Alcohol vs Other Features

We can see negative influence of these variables density, residual.sugar, chlorides, total.sulfur.dioxide, free.sulfur.dioxide in alcohol level. Only pH shows a positive correlation.

pH vs fixed.acidity and pH vs residual sugar

## [1] -0.4258583

## 
##  Pearson's product-moment correlation
## 
## data:  wines$pH and wines$residual.sugar
## t = -13.8472, df = 4896, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.2209387 -0.1670352
## sample estimates:
##        cor 
## -0.1941335

I certainly would expect fixed.acidity to be correlated to pH. In chemistry when acidity goes down, the pH value should goes up as well. There’s no clear pattern on residual.sugar and pH, even though they are negative correlated.

free.sulfur.dioxide vs total.sulfur.dioxide

## 
##  Pearson's product-moment correlation
## 
## data:  wines$free.sulfur.dioxide and wines$total.sulfur.dioxide
## t = 54.6447, df = 4896, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.5977994 0.6326026
## sample estimates:
##      cor 
## 0.615501

Besides main feature and secondary feature. Other features such as free.sulfur.dioxide and total.sulfur.dioxide are correlated to one another.

Residual Sugar vs Density

## 
##  Pearson's product-moment correlation
## 
## data:  wines$density and wines$residual.sugar
## t = 107.8749, df = 4896, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.8304732 0.8470698
## sample estimates:
##       cor 
## 0.8389665

When top 1% of observations is excluded and plotting both of these two variables residual.sugar and density together shows a strong correlation.

Bivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. How did the feature(s) of interest vary with other features in the dataset?

Since our main goal was to find out what features affects quality score. The main relationship observed was alcohol and quality, pH and quality. Besides that density and chlorides have negative influence on qualty.

Did you observe any interesting relationships between the other features (not the main feature(s) of interest)?

The other features which shows interesting relationship was free.sulfur.dioxide and total.sulfur.dioxide with a correlation of 0.616. Both are correlated to one another as both are free and bound forms of sulfur dioxide gas (S02).

What was the strongest relationship you found?

The strongest relationship found was density and residual.sugar with a correlation of 0.828.

Multivariate Plots Section

Residual Sugar vs Density vs Alcohol

Since we found out that density and residual.sugar has the strongest relationship. I explored more with alcohol. From the plot it appears that as density and residual.sugar increases the alcohol color seems to turn darker which means alcohol decreses.

Residual Sugar vs Density vs Quality Cut

When I swapped out alcohol with my defined new variable quality cut. The result were more obvious. We can see that better quality score wines are concentrated on the left hand side of the plot where as bad quality score wines are on the right.

pH vs fixed.acidiy vs Quality

Initially, I thought pH has a little bit of correlation with quality. Thus pH vs fixed.acidity should somewhat linked with quality. After plotting it out there seems no strong pattern that we can identify from pH vs fixed.acidity vs quality.

Multivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. Were there features that strengthened each other in terms of looking at your feature(s) of interest?

The relationship between density and residual.sugar and alcohol seems quite interesting. Since all of them are highly correlated with one another, it is easy to spot the changes as any one of the variable varies. When we swapped out alcohol with quality score the same correlation can be spotted as well.

Were there any interesting or surprising interactions between features?

As I mentioned above, I thought pH would provide some interaction with quality. But after plotting it out. It’s a bit hard to identify it from the plot.

OPTIONAL: Did you create any models with your dataset? Discuss the strengths and limitations of your model.

No.

Final Plots and Summary

Plot One

Description One

When the histogram is facet by quality score cut. We can see that bad quality score wines show a positive skewed towards lower alcohol level. Average quality score wines shows us more of a shape of normal distribution of alcohol. Best quality score shows a negatively skewed distribution towards higher alcohol level.

Plot Two

Description Two

This graph shows that as density and residual sugar increase, alcohol level decreases. This can be clearly seen in best quality score wines where the color goes from light to dark. Even thought the same color variation applies to average quality score wines and bad quality score wines but its not that significant.

Plot Three

Description Three

In this graph, using the newly defined variable cut of quality score. We can see that best wines quality score (8~10) concentrate in the lower right quadrant. That is when density is low and alcohol level is high. For bad quality score wines (0~5), its concentrated in the upper left quadrant with high density and low alcohol level.


Reflection

There are two difficulties which I encountered during the analaysis. One, the lack of categorical variables. In the dataset there is only one categorical variable that we can use which is quality, I wish that there were more variables like that as it would allow us to identify more relationship between the variables using subset. Two, the lack of correlations between variables. There are some variables that shows very low correlation with any other variables.

Even with the limitation that we had in our dataset, we are still able to discover very interesting findings such as alcohol and density and residual.sugar. I was only able to identify a few of them to understand its influvence over quality score from a scatterplot matrix.

There are many other factors that can determine a good wine. Things like smells and flavours and not chemical properties can be documented in the dataset to allow us to explore further to find out what is a good quality wines.